Day 1: Heterogeneity and Dynamics in Data

Linear Models and Summary

Robert W. Walker

2025-07-20

Core Details

  • Who am I? Why am I here? What will I learn?
  • How will we do this? The course has a website. Built in R.
  • Information on backgrounds
  • Discussion of syllabus. It’s all on the box.

Preliminaries

  • An Ode to Harold
  • Reader: We have a course textbook and essential articles for the course online. The syllabus outlines issues from textbooks that have more detailed foundations. One can learn a great deal from walking through fine details. Each day, near the end, I will signpost what of the readings to focus for the following day.
  • Questions: Ask them. Please.
  • Diction: I speak (VERY) quickly. Remind me when this is excessive (ALWAYS?); I will undoubtedly forget and this doesn’t help anyone (including me as I burn through slides).

The course: One then Multiple Time Series

  • Data analysis is all about variation – random variables.
  • We usually want parameters that usefully describe that variation in a social science world.
  • Sources of variation are our key concern and their influence on our parameters. This comes in two classes.
    • Uncorrelated with regressors.
    • Correlated with regressors.
      • Pooling admits two obvious sources of variation: i and t (however defined).
  • All our concerns follow from this. First t.

Data Examples

What sort of data do you work with that have multiple units observed over time?

  • The time series of Oregon’s Bond rating. \rightarrow Bond ratings among U. S. states.
  • Nine justices vote on a case \rightarrow US Supreme Court justices/votes
  • a dyad fights \rightarrow Wars/Military conflicts
  • OECD countries/global samples in political economy.
  • FDI inflows/outflows, etc.
  • Voter turnout across nations or US states/counties/municipalities.

Some Basic Things

  • Time is complicated.
  • Time-zones
  • Daylight savings
  • Irregular periodicity

Fortunately, the computer can sort of handle this but the structure has to be declared. Stata has ts and R has the tsibble structure that I will use among many others.

Before We Start, A Review?

  • Matrix Algebra
  • What are matrices?
    • Matrix multiplication/addition/etc.
  • Matrix Inversion?
    • Trace, Kronecker Product, Eigenvalues, Determinants, and the like
  • Statistics – Expectations and Normal, \chi^2, t, and F
    • Linear Regression
  • What assumptions are necessary for the OLS estimator to be unbiased?
    • How about the asymptotic variance to be correct?
    • What about the Gauss-Markov conditions?

Why Matrices?

  • A matrix is the natural structure for panel data/CSTS/TSCS.
  • Individuals/households/countries form rows.
  • Time points form columns.

On Matrices

A matrix is a rectangular array of real numbers. If it has m rows and n columns, we say it is of dimension m\times n.

A = \left( \begin{array}{ccccc} a_{11} & a_{12} & \ldots & a_{1n} \\ a_{21} & a_{22} & \ldots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \ldots & a_{mn} \end{array} \right)

A vector (x_{1}, x_{2}, \ldots, x_{n}) \in \mathbb{R}^{n} can be thought as as a matrix with n rows and one column or as a matrix with one row and n columns.

Products

Periodicially, we will wish to make use of two types of vector products.

  • Inner (or dot) product Let \mathbf{u} = (u_{1}, \ldots, u_{n}) and \mathbf{v} = (v_{1}, \ldots, v_{n}) be two vectors in \mathbb{R}^{n}. The Euclidean inner product of \mathbf{u} and \mathbf{v}, written as \mathbf{u} \cdot \mathbf{v}, is the number \mathbf{u}\cdot\mathbf{v} = u_{1}v_{1} + u_{2}v_{2} + \ldots + u_{n}v_{n} which can also be written \mathbf{u}^{\prime}\mathbf{v}
  • Outer product \mathbf{u}\mathbf{v}^{\prime} is the cross-product matrix where each row and column are the product of row u with column v.

\mathbf{uv}^{\prime} = \left( \begin{array}{ccccc} u_{1}v_{1} & u_{1} v_{2} & \ldots & u_{1} v_{n} \\ u_{2}v_{1} & u_{2}v_{2} & \ldots & u_{2}v_{n} \\ \vdots & \vdots & \ddots & \vdots \\ u_{m}v_{1} & u_{m}v_{2} & \ldots & u_{m} v_{n} \end{array} \right)

Matrix Inversion

There are two primary methods for inverting matrices. The first is often referred to as Gauss-Jordan elimination and the second is known as Cramer’s rule. The former involves a series of elementary row operations undertaken on the matrix of interest and equally to I while the latter relies on a determinant and the adjoint of the matrix of interest.

Let A and B be square invertible matrices. It follows that:

  • (A^{-1})^{-1} = A
  • (A^{T})^{-1} = (A^{-1})^{T}
  • AB is invertible and (AB)^{-1} = B^{-1}A^{-1}.

For a matrix A, the following are equivalent: 1. A is invertible. 2. A is nonsingular. 3. For all y, Ax=y has a unique solution. 4. \det A \neq 0 and A is square.

Definiteness

Given a square matrix A and a vector \mathbf{x}, we can claim that - A is negative definite if, \forall x \neq 0,\; x^{T}Ax < 0 - A is positive definite if, \forall x \neq 0,\; x^{T}Ax > 0 - A is negative semi-definite if, \forall x \neq 0,\; x^{T}Ax \leq 0 - A is positive semi-definite if, \forall x \neq 0,\; x^{T}Ax \geq 0

Brief diversion on principal submatrices and (leading) principal minors toward a sufficient condition for characterizing definiteness.

NB: The trace of a square matrix is the sum of its diagonal elements.

Enough Matrices: To Statistics

We needed this to:

  • Determine invertibility and the relation to definiteness.
  • A matrix version of the Cauchy-Schwartz inequality sets the characteristics of variance/covariance matrices for linear regression problems (avoiding equality).
  • To familiarize the idea because error matrices are nice to think on.
  • Linking time series properties through the previous to invertibility is key. Especially because of block inversion

Random Variables, Expectations, etc. etc.

  • Random Variables: Real-valued function with domain: a sample space.

  • Mean (Expected Value): E[x] = \int_{x} x\; f(x)\; dx or E[x] = \sum_{x} x\; p(x)

  • Variance (Spread): V[x] = E(x^2) - [E(x)]^2

  • Covariance: E[xy] = E[(x_{i} - [E(x)])(y_{i} - [E(y)])]

  • Correlation: \rho = \frac{E[(x_{i} - [E(x)])(y_{i} - [E(y)])]}{\sigma_{x}\sigma_{y}} = \frac{Cov(XY)}{\sqrt{\sigma^{2}_{x}}\sqrt{\sigma^{2}_{y}}}

  • Variance of Linear Combination: \sum_{i}\sum_{j} a_{i}a_{j}Cov(X_{i}X_{j})

Some (Loosely Stated) Distributional Results

  • Self-reproducing property of N
  • The implications of Basu’s Theorem (Independence of Mean and Variance of Normals)
  • N^2 \sim \chi^2
    • \frac{\chi^2_m}{\chi^2_n} \sim F_{m,n}
    • t is given by \frac{N}{\sqrt{\frac{\chi^2}{v}}}

Normal random variables

A random variable X has a (\mu \in \mathbb{R} and \sigma^{2} \in \mathbb{R}^{++}) if X has a continuous distribution for which the probability density function (p.d.f.) f(x|\mu,\sigma^2 ) is as follows (for finite x): f(x|\mu,\sigma^2 ) = \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp \left[ -\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^{2}\right] - If X \sim N(\mu,\sigma^2), Let Y = aX + b (a \neq 0). Y \sim N(a\mu + b, a^{2}\sigma^{2}). - Z \sim N(0,1) allow the percentiles for any normal: Z = \frac{x - \mu}{\sigma}. - Sums of (independent) normals are normal. - Sums of affine transformations of (independent) normals are normal.

Basu’s Theorem

Let X_1, X_2, \ldots, X_n be independent, identically distributed normal random variables with mean \mu and variance \sigma^2. With respect to \mu, \widehat{\mu}=\frac{\sum X_i}{n}, the sample mean is a complete sufficient statistic – it is informationally optimal to estimate \mu. \widehat{\sigma}^2=\frac{\sum \left(X_i-\bar{X}\right)^2}{n-1}, the sample variance, is an ancillary statistic – its distribution does not depend on \mu.

These statistics are independent (also can be proven by Cochran’s theorem). This property (that the sample mean and sample variance of the normal distribution are independent) characterizes the normal distribution; no other distribution has this property.

\chi^2 random variables

If a random variable X has a then the probability density function of X (given x > 0 is f(x) = \frac{1}{2^{\frac{n}{2}}\Gamma(\frac{n}{2})}x^{(\frac{n}{2}) - 1}\exp\left(\frac{-x}{2}\right)

Two key properties: 1. If the random variables X_{1},\ldots,X_{k} are independent and if X_{i} has a \chi^2 distribution with n_{i} degrees of freedom, then \sum_{i=1}^{k} X_{i} has a \chi^{2} distribution with \sum_{i=1}^{k} n_{i} degrees of freedom.
2. If the random variables X_{1},\ldots,X_{k} are independent and, \forall i: X_{i} \sim N(0,1), then \sum_{i=1}^{k} X^{2}_{i} has a \chi^{2} distribution with k degrees of freedom.

Student’s t distributed random variables

Consider two independent random variables Y and Z, such that Y has a \chi^{2} distribution with n degrees of freedom and Z has a standard normal distribution. If we define X = \frac{Z}{\sqrt{\frac{Y}{n}}} then the distribution of X is called the t distribution with n degrees of freedom. The t has density (for finite x) f(x) = \frac{\Gamma\left(\frac{n+1}{2}\right)}{(n\pi)^{\frac{1}{2}}\Gamma\left(\frac{n}{2}\right)}\left(1 + \frac{x^{2}}{n}\right)^{\frac{-(n+1)}{2}}

F distributed random variables (Variance-Ratio)

Consider two independent random variables Y and W, such that Y has a \chi^{2} distribution with m degrees of freedom and W has a \chi^{2} distribution with n degrees of freedom, where m, n \in \mathbb{R}^{++}. We can define a new random variable X as follows: X = \frac{\frac{Y}{m}}{\frac{W}{n}} = \frac{nY}{mW} then the distribution of X is called an F distribution with m and n degrees of freedom.

Distributions and Matrices

This gets us through the background to this point. We will invoke parts of this as we go from here. We have matrices and ther inverses. We have some distributional results that link the normal, \chi^2, t, and F. We have a theorem regarding and the normal. This gives us the intuition for Gauss-Markov. Nevertheless, let’s begin the meat of it all: regression.

The Orthogonal Projection

One of the features of the Ordinary Least Squares Estimator is the orthogonality (independence) of the estimation space and the error space.

  • E[\hat{e}_{i}\hat{y}_{i}] = 0
  • E[\hat{e}_{i}{x}_{i}] = 0

OLS: Assumptions

  1. Linearity: y = X\beta + \epsilon
  2. Strict Exogeneity E[\epsilon | X] = 0
  3. No [perfect] multicollinearity:\ Rank of the NxK data matrix X is K with probability 1 (N > K).
  4. X is a nonstochastic matrix.
  5. Homoskedasticity E[\epsilon\epsilon^{\prime}] = \sigma^2I s.t. \sigma^{2} > 0

Returning to the Regression Model

Now we want to reexamine the minimization of the sum of squared errors in a matrix setting. We wish to minimize the inner product of \epsilon^{\prime}\epsilon. \epsilon^{\prime}\epsilon = (y - X\beta)^{\prime}(y - X\beta) = y^{\prime}y - y^{\prime}X\beta - \beta^{\prime}X^{\prime}y + \beta^{\prime}X^{\prime}X\beta

Take the derivative, set it equal to zero, and solve…. \frac{\partial \epsilon^{\prime}\epsilon}{\partial \beta} = -2X^{\prime}y + 2X^{\prime}X\beta = 0 X^{\prime}y = X^{\prime}X\beta
So we rearrange to obtain the solution in matrix form.

\hat{\beta}_{OLS} = (X^{\prime}X)^{-1}X^{\prime}y

Properties of OLS Estimators

  • Unbiasedness is E(\hat{\beta} - \beta) = 0.
  • Variance E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^{\prime}].
  • The Gauss-Markov Theorem – Minimum Variance Unbiased Estimator.

The first two

Need nothing about the distribution other than the two moment defintions. It is for number three that this starts to matter and, in many ways, this is directly a reflection of Basu’s theorem.

Unbiasedness

With \hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y, \mathbb{E}[\hat{\beta} - \beta] = 0 requires, \mathbb{E}[(X^{\prime}X)^{-1}X^{\prime}y - \beta] = 0 We require an inverse already. Invoking the definition of y, we get \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}(\mathbf{X}\beta + \epsilon) - \beta] = 0 \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta] = 0 Taking expectations and rearranging. \hat{\beta} - \beta = -\mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon] If the latter multiple is zero, all is well.

Variance

\mathbb{E}[(\hat{\mathbf{\beta}} - \beta)(\hat{\mathbf{\beta}} - \beta)^{\prime}] can be derived as follows. \mathbb{E}[(\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)(\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\mathbf{X}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)^{\prime}] \mathbb{E}[(\mathbf{I}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)(\mathbf{I}\beta + \mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon - \beta)^{\prime}] Recognizing the zero part from before, we are left with the manageable,

\mathbb{E}[(\hat{\mathbf{\beta}} - \beta)(\hat{\mathbf{\beta}} - \beta)^{\prime}] \mathbb{E}[\mathbf{(X^{\prime}X)^{-1}X^{\prime}}\epsilon\epsilon^{\prime}\mathbf{X(X^{\prime}X)^{-1}}] Nifty. With nonstochastic \mathbf{X}, it’s the structure of \epsilon\epsilon^{\prime} and we know what that is. By assumption, we have \sigma^{2}\mathbf{I}. If stochastic, we need more steps to get to the same place.

Gauss-Markov

Proving the Gauss-Markov theorem is not so instructive. From what we already have, we are restricted to linear estimators, we add or subtract something. So after computation, we get the OLS standard errors plus a positive semi-definite matrix. OLS always wins. From here, a natural place to go is corrections for non \mathbf{I}. We will do plenty of that. And we will eventually need Aitken.

Special Matrices

Beyond this, lets take up two special matrices (that will be your favorite matrices):

  1. Projection Matrix : \mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}

  2. Residual Maker : \mathbf{I} - \mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}

which are both symmetric and idempotent (\mathbf{M}^{2}=\mathbf{M}).

On M and P

M \mathbf{M} = \mathbf{I} - \mathbf{X(X^\prime X)^{-1}X^\prime} \mathbf{My} = (\mathbf{I} - \mathbf{X(X^\prime X)^{-1}X^\prime})\mathbf{y} \mathbf{My} = \mathbf{Iy} - \mathbf{X\underbrace{(X^\prime X)^{-1}X^\prime \mathbf{y}}_{\hat{\beta}}} \mathbf{My} = \mathbf{y} - \mathbf{X\hat{\beta}} \mathbf{My} = \hat{\epsilon}

P

\mathbf{P} = \mathbf{I - M} \mathbf{P} = \mathbf{I - (I - X(X^\prime X)^{-1}X^\prime}) \mathbf{P} = \mathbf{X(X^\prime X)^{-1}X^\prime} \mathbf{Py} = \mathbf{X\underbrace{(X^\prime X)^{-1}X^\prime\mathbf{y}}_{\hat{\beta}}} \mathbf{Py} = \mathbf{X\hat{\beta}} \mathbf{Py} = \hat{\mathbf{y}}

General Linear Regression Model: Multiple Regression

y = X\beta + \epsilon

  1. X is a n \times k matrix of regressors (where the first column is 1)
  2. \beta is a k \times 1 matrix of unknown (partial) slope coefficients
  3. \epsilon is an (unknown) residual
  • A partial slope is the analog of our simple bivariate regression except that there are multiple regressors.
  • Partial: it is the effect of x_{k} when all other variables are held constant.

Regression and Inference

y = X\beta + \epsilon

  1. Linear regression model (or sort of):
    • Same tools for testing hypotheses
    • Tests work for multiple partial slopes
    • t can test the hypothesis of no relationship between x_k and y; \beta_k = 0
    • F can compare the fit of two models or the joint hypothesis that \beta_1 = \beta_2 = \ldots = \beta_k = 0.

The latter F test appears in standard regression output and the probability compares a model with only a constant to the model you have estimated with H_0: Constant only. \ \ If we just like F better than t, it happens that (t_{\nu})^2 \sim F_{1,\nu}, but t has the advantage of being (potentially) one-sided.

plot(density(rt(100000,df=10)^2), xlab="t-squared/F", bty="n", main="")
lines(density(rf(100000, 1, 10)), col=2)
legend(x=20, y=0.8, legend=c(expression(t^2),expression(F[1-10])), col=c(1,2), bty="n", lty=1)

Confidence Intervals

In forming confidence intervals, one must account for metrics. t is defined by a standard deviation metric and the standard deviation remains in a common metric with the parameter \hat{\beta}. Variance represents a squared metric in terms of the units measured by \hat{\beta}. As a result, we will form confidence intervals from standard deviations instead of variances.

  1. Prediction intervals: A future response: \hat{y}_{0} \pm t^{(\frac{\alpha}{2})}_{n-p}\hat{\sigma}\sqrt{1 + x^{\prime}_{0}(\mathbf{X^{\prime}X})^{-1}x_{0}}

  2. A mean response: \hat{y}_{0} \pm t^{(\frac{\alpha}{2})}_{n-p}\hat{\sigma}\sqrt{x^{\prime}_{0}(\mathbf{X^{\prime}X})^{-1}x_{0}}

\hat{\beta} Confidence Intervals

  • Individual: \hat{\beta} \pm t^{(\frac{\alpha}{2})}_{n-p}\hat{\sigma}\sqrt{(\mathbf{X^{\prime}X})^{-1}_{ii}} Unfortunately, unless the off-diagonal elements of the variance/covariance matrix of the estimates are zero, individual confidence intervals are/can be deceptive. A better way is to construct a simultaneous confidence interval.
  • Simultaneous for k regressors: (\mathbf{\hat{\beta} - \beta})^{\prime}\mathbf{X^{\prime}X}(\mathbf{\hat{\beta} - \beta}) \leq k \hat{\sigma}^{2}F^{(\alpha)}_{k,n-k}

Omitted Variable Bias

Suppose that a correct specification is \mathbf{y} = \mathbf{X_{1}\beta_{1}} + \mathbf{X_{2}\beta_{2}} + \mathbf{\epsilon} where \mathbf{X_{1}} consists of k_{1} columns and \mathbf{X_{2}} consists of k_{2} columns. Regress just \mathbf{X_{1}} on \mathbf{y} without including \mathbf{X_{2}}, we can characterize \mathbf{b_{1}}, \mathbf{b_{1}} = (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X_{1}y} \rightarrow (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}[\mathbf{X_{1}\beta} + \mathbf{X_{2}\beta_{2}} + \mathbf{\epsilon}] = (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}\mathbf{X_{1}\beta} + (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}\mathbf{X_{2}\beta_{2}} + (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}\mathbf{\epsilon} = \mathbf{\beta_{1}} + (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}\mathbf{X_{2}\beta_{2}} + (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}\mathbf{\epsilon} Two elements worthy of consideration.
1. If \mathbf{\beta_{2}}=0 everything is fine, assuming the standard assumptions hold. The reason: we have not really misspecified the model. 2. Also, if the standard assumptions hold and \mathbf{X_{1}^{\prime}X_{2}}=0 then the second term also vanishes (even though \mathbf{\beta_{2}}\neq 0). If, on the other hand, neither of these conditions hold, but we estimate the regression in any case, the estimate of \mathbf{b_{1}} will be biased by a factor of (defining \mathbf{P}= (\mathbf{X_{1}^{\prime}X_{1}})^{-1}\mathbf{X^{\prime}_{1}}\mathbf{X_{2}})

\mathbf{P}_{X_{1}X_{2}}\mathbf{\beta_{2}} What is \mathbf{P}_{X_{1}X_{2}}?

Structural Consistency

In specifying a regression model, we assume that its assumptions apply equally well to all the observations in our sample. They may not. Fortunately, we can test claims of structural stability using techniques that we already have encountered. H_0 : Structural stability.

  1. Estimate a linear regression assuming away the structural instability. Save the Residual Sum of Squares, call it S_{1}.
  2. Estimate whatever regressions you believe to be implied by the hypothesis of structural instability and obtain their combined Residual Sum of Squares. Call it S_{4}.
  3. Subtract the RSS obtained from step 2 S_{4} from the RSS obtained in step 1 S_{1}. Call it S_{5}.
  4. F_{(k, n_{1} + n_{2} - 2k)} = \frac{S_{5} / k}{S_{4}/(n_{1} + n_{2} - 2k)}

Revisiting \sigma^{2}I

Now we can move on to considering the properties of the residuals and there conformity with assumptions we have made about their properties.

  1. Homoscedasticity
  2. Normality
    • Jarque-Bera test [for regression, should be n=N-k] (sum , detail) JB = \frac{n}{6}\left(S^{2} + \frac{(K-3)^{2}}{4} \right)

Comparing Regression Models

Two types of models, in general 1. Nested models 2. Nonnested models

In layman’s terms, nested models arise when one model is a special case of the other. For example, \mathbf{y} = \mathbf{\beta_{0}} + \mathbf{X_{1}\beta_{1}} + \mathbf{\epsilon} is nested in \mathbf{y} = \mathbf{\beta_{0}} + \mathbf{X_{1}\beta_{1}} + \mathbf{X_{2}\beta_{2}} + \mathbf{\epsilon} using the restriction that \mathbf{\beta_{2}} = 0. If models are nested, usual techniques can be used. If not, we must turn to alternative tools. Technically, there is probably an intermediate class that would be appropriately named overlapping. Practically, overlapping have some nested elements and some nonnested elements. Almost always, we will need the nonnested tools for these.

Influence Diagnostics

  1. dfbeta
  2. Cook’s Distance
  3. Added-variable plots
  4. RESET

A Natural Conception of Panel Data as an Array

Consider i \in N units observed at t \in T points in time. The normal cross-sectional data structure will use variables as columns and i as rows. Panel data then adds the complication of a third dimension, t. If we were to take this third dimension and transform the array to two dimensions, we would end up with an (N \times T) by K matrix of covariates and (for a single outcome) an NT vector.

To Pool or Not to Pool?

  1. Virtues of Panel Data
  • More accurate inference and variety in asymptotics.
  • Control over complexity (Regressors and parameters).
  • Required to isolate short-run and long-run effects simultaneously. Policy and reaction functions?
  • The number of observations grows. More data can’t really provide less information, with all the Berk/Freedman caveats. Provable in straightforward fashion with important implications.
  • Explicit characterizations of within- and between- variation.
  • Simplification of computation for complex problems.
    • Non-stationary time series
    • Measurement error.
    • Computational tricks (dynamic tobit)

Examining Pooling Assumptions

  • Data (What is an outlier in this setting?)
    • D_{M} (x) = \sqrt{(x - \mu)^{\prime}\Sigma^{-1}(x - \mu)}: \ Mahalanobis distance is a generalization of Euclidean distance (\Sigma^{-1}=\Sigma=I) with an explicit covariance matrix. Ali Hadi’s work on multivariate outliers uses something similar with reordering to maximize a two subset Mahalanobis distance. aside
    • Jackknife summary statistics on one- or two-dimensions.
    • The real worry seems to be classes/clusters/groups that are different/distinct.
  • Models (Stability of a model? and Influence)
    • Chow test: F-test on pooled against split sample regression. Perhaps iterated Chow tests using combinatoric algorithms over sizes.
    • Changepoint models, Regimes and Regime Switching and Mixtures

Convenience Samples and the Like

  • Hsiao isolates many of the central issues in panel data from the view of an econometrician. The argument is a bit broader in the sense of repetition.

  • Berk and Freedman isolate important issues of particular relevance to the types of structures we will look at.

    • Convenience samples are a fact of social scientific life.
    • Treating the data as a population obviates inference.
    • As-if and imaginary sampling mechanisms. Uncertainty gets really hard.
    • Imaginary populations and imaginary sampling designs: a fiction?
    • What we do with them is our responsibility, but we should be fair.
    • Getting more data gives us the ability, but also the need, to do much more.

The Dimensions of TSCS/CSTS and Summary

  • Presence of a time dimensions gives us a natural ordering.
  • Space is not irrelevant under the same circumstances as time – nominal indices are irrelevant on some level. Defining space is hard. Ex. targeting of Foreign Direct Investment and defining proximity.
  • ANOVA is informative in this two-dimensional setting.
  • A part of any good data analysis is summary and characterization. The same is true here; let’s look at some examples of summary in panel data settings.

A Primitive Question

Given two-dimensional data, how should we break it down? The most common method is unit-averages; we break each unit’s time series on each element into deviations from their own mean. This is called the within transform. The between portion represents deviations between the unit’s mean and the overall mean. Stationarity considerations are generically implicit. We will break this up later.

Some Useful Variances and Notation

  • W(ithin) for unit i (Thus the total within variance would be a summary over all i \in N): W_{i} = \sum_{t=1}^{T} (x_{it} - \overline{x}_{i})^{2}
  • B(etween): B_{T} = \sum_{i=1}^{N} (\overline{x}_{i} - \overline{x})^{2}
  • T(otal): T = \sum_{i=1}^{N} \sum_{t=1}^{T} (x_{it} - \overline{x})^{2}

Basic xt commands

In Stata’s language, xt is the way that one naturally refers to CSTS/TSCS data. Consider NT observations on some random variable y_{it} where i \in N and t \in T. The TSCS/CSTS commands almost always have this prefix.

  • xtset: Declaring xt data
  • xtdes: Describing xt data structure
  • xtsum: Summarizing xt data
  • xttab: Summarizing categorical xt data.
  • xttrans: Transition matrix for xt data.

Basic R commands

library(haven)
HR.Data <- read_dta(url("https://github.com/robertwwalker/DADMStuff/raw/master/ISQ99-Essex.dta"))
library(skimr)
skim(HR.Data) %>% kable() %>% scroll_box(width="80%", height="50%")
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric IDORIGIN 0 1.0000000 446.7178771 243.1931782 2.00 290.000 435.000 640.00 990.00 ▆▇▇▆▂
numeric YEAR 0 1.0000000 1984.5000000 5.1889328 1976.00 1980.000 1984.500 1989.00 1993.00 ▇▆▇▆▇
numeric AI 1061 0.6707014 2.7533549 1.0752989 1.00 2.000 3.000 3.00 5.00 ▃▇▇▃▂
numeric SD 587 0.8178150 2.2406072 1.1303528 1.00 1.000 2.000 3.00 5.00 ▇▇▆▂▁
numeric POLRT 382 0.8814401 3.8095070 2.2230297 1.00 2.000 3.000 6.00 7.00 ▇▂▂▁▇
numeric MIL2 382 0.8814401 0.2725352 0.4453421 0.00 0.000 0.000 1.00 1.00 ▇▁▁▁▃
numeric LEFT 393 0.8780261 0.1763874 0.3812168 0.00 0.000 0.000 0.00 1.00 ▇▁▁▁▂
numeric BRIT 290 0.9099938 0.3553888 0.4787126 0.00 0.000 0.000 1.00 1.00 ▇▁▁▁▅
numeric PCGNP 443 0.8625078 3591.6509536 5698.3554010 52.00 390.000 1112.000 3510.00 36670.00 ▇▁▁▁▁
numeric AINEW 468 0.8547486 2.4433551 1.1558005 1.00 1.000 2.000 3.00 5.00 ▇▇▇▃▂
numeric SDNEW 468 0.8547486 2.2618010 1.1365604 1.00 1.000 2.000 3.00 5.00 ▇▇▆▂▁
numeric IDGURR 0 1.0000000 455.7709497 246.5201369 2.00 290.000 450.000 663.00 990.00 ▆▇▇▇▃
numeric AILAG 644 0.8001241 2.4499612 1.1479673 1.00 1.000 2.000 3.00 5.00 ▇▇▇▃▂
numeric SDLAG 644 0.8001241 2.2470908 1.1156632 1.00 1.000 2.000 3.00 5.00 ▇▇▆▂▁
numeric PERCHPCG 618 0.8081937 4.6138441 13.2208934 -95.50 -2.545 4.615 11.76 128.57 ▁▂▇▁▁
numeric PERCHPOP 293 0.9090627 2.1928815 4.0424128 -48.45 0.910 2.220 2.94 126.01 ▁▇▁▁▁
numeric LPOP 115 0.9643079 15.4819279 1.8633316 11.00 14.510 15.590 16.64 20.89 ▂▃▇▃▁
numeric PCGTHOU 443 0.8625078 3.5916985 5.6983334 0.05 0.390 1.110 3.51 36.67 ▇▁▁▁▁
numeric DEMOC3 793 0.7538796 3.6817620 4.3577178 0.00 0.000 0.000 9.00 10.00 ▇▁▁▂▃
numeric CWARCOW 407 0.8736809 0.0920071 0.2890873 0.00 0.000 0.000 0.00 1.00 ▇▁▁▁▁
numeric IWARCOW2 380 0.8820608 0.0862069 0.2807187 0.00 0.000 0.000 0.00 1.00 ▇▁▁▁▁

More R summary

library(tidyverse)
library(plm)
source(url("https://raw.githubusercontent.com/robertwwalker/DADMStuff/master/xtsum/xtsum.R"))
# Be careful with the ID variable, the safest is to make it factor; this can go wildly wrong
xtsum(IDORIGIN~., data=HR.Data) %>% kable() %>% scroll_box(width="80%", height="50%")
O.mean O.sd O.min O.max O.SumSQ O.N B.mean B.sd B.min B.max B.Units B.t.bar W.sd W.min W.max W.SumSQ Within.Ovr.Ratio
YEAR 1984.5 5.189 1976 1993 86725.5 3222 1984.5 0 1984.5 1984.5 179 18 5.189 -8.5 8.5 86725.5 1
AI 2.753 1.075 1 5 2497.538 2161 2.498 0.989 1 5 173 12.491 0.631 -2.375 2.5625 860.822 0.345
SD 2.241 1.13 1 5 3365.455 2635 2.241 1.004 1 5 178 14.803 0.624 -2.666667 3.0625 1025.695 0.305
POLRT 3.81 2.223 1 7 14029.94 2840 3.78 1.99 1 7 179 15.866 0.925 -4 4.777778 2428.552 0.173
MIL2 0.273 0.445 0 1 563.058 2840 0.24 0.377 0 1 179 15.866 0.216 -0.9444444 0.8888889 132.778 0.236
LEFT 0.176 0.381 0 1 410.983 2829 0.157 0.334 0 1 179 15.804 0.157 -0.8888889 0.8888889 69.611 0.169
BRIT 0.355 0.479 0 1 671.685 2932 0.335 0.473 0 1 179 16.38 0 0 0 0 0
PCGNP 3591.651 5698.355 52 36670 90205144379 2779 3449.178 5049.297 112.2222 22653.89 173 16.064 2278.412 -12303.33 16961.67 14421042273 0.16
AINEW 2.443 1.156 1 5 3677.663 2754 2.379 1.012 1 5 178 15.472 0.622 -2.388889 2.944444 1064.102 0.289
SDNEW 2.262 1.137 1 5 3556.241 2754 2.253 1.006 1 5 178 15.472 0.631 -2.588235 3 1096.442 0.308
IDGURR 455.771 246.52 2 990 195747185 3222 455.771 247.173 2 990 179 18 0 0 0 0 0
AILAG 2.45 1.148 1 5 3396.045 2578 2.402 1.039 1 5 177 14.565 0.609 -2.411765 3 955.37 0.281
SDLAG 2.247 1.116 1 5 3207.603 2578 2.236 0.991 1 5 177 14.565 0.608 -2.5 3.058824 952.174 0.297
PERCHPCG 4.614 13.221 -95.5 128.57 454983.6 2604 3.325 6.893 -36.21333 15.03765 168 15.5 12.393 -92.50235 114.8882 399763 0.879
PERCHPOP 2.193 4.042 -48.45 126.01 47846.75 2929 2.842 9.443 -2.126471 126.01 176 16.642 3.018 -48.12235 80.69765 26663.59 0.557
LPOP 15.482 1.863 11 20.89 10784.05 3107 15.488 1.844 11.09056 20.76889 177 17.554 0.129 -0.7288889 0.7311111 51.883 0.005
PCGTHOU 3.592 5.698 0.05 36.67 90204.45 2779 3.449 5.049 0.1122222 22.65389 173 16.064 2.278 -12.30333 16.96167 14420.95 0.16
DEMOC3 3.682 4.358 0 10 46107 2429 3.774 3.96 0 10 155 15.671 1.726 -7.277778 7.941176 7229.815 0.157
CWARCOW 0.092 0.289 0 1 235.17 2815 0.095 0.245 0 1 179 15.726 0.175 -0.8888889 0.9444444 85.693 0.364
IWARCOW2 0.086 0.281 0 1 223.879 2842 0.092 0.227 0 1 179 15.877 0.19 -0.8888889 0.9444444 102.992 0.46

The Core Idea

In R, this is an essential group_by calculation in the tidyverse. The within data is the overall data with group means subtracted.

HR.Data %>% 
  group_by(IDORIGIN) %>% 
  mutate(DEMOC.Centered = 
           DEMOC3 - mean(DEMOC3, na.rm=TRUE)) %>%
  filter(IDORIGIN==42) %>% 
  select(IDORIGIN, YEAR, DEMOC3, DEMOC.Centered) 
# A tibble: 18 × 4
# Groups:   IDORIGIN [1]
   IDORIGIN  YEAR DEMOC3 DEMOC.Centered
      <dbl> <dbl>  <dbl>          <dbl>
 1       42  1976      1         -5.11 
 2       42  1977      1         -5.11 
 3       42  1978      6         -0.111
 4       42  1979      6         -0.111
 5       42  1980      6         -0.111
 6       42  1981      6         -0.111
 7       42  1982      7          0.889
 8       42  1983      7          0.889
 9       42  1984      7          0.889
10       42  1985      7          0.889
11       42  1986      7          0.889
12       42  1987      7          0.889
13       42  1988      7          0.889
14       42  1989      7          0.889
15       42  1990      7          0.889
16       42  1991      7          0.889
17       42  1992      7          0.889
18       42  1993      7          0.889

HR.Data %>% 
  group_by(IDORIGIN) %>% 
  mutate(DEMOC.Centered = DEMOC3 - mean(DEMOC3, na.rm=TRUE)) %>%
  DT::datatable(., 10,
fillContainer = FALSE, options = list(pageLength = 8)
)

Outline for Day 2

Big picture: Models for Single Time Series

  • Stationarity and differencing
  • Spurious regressions: Yule (1926)
  • Autoregressive and moving average terms.
  • Unit-root testing
  • Event Studies
  • The model/substance interaction